Protein condensation diseases: therapeutic opportunities

Condensed states of proteins, including liquid-like membraneless organelles and solid-like aggregates, contribute in fundamental ways to the organisation and function of the cell. Perturbations of these states can lead to a variety of diseases through mechanisms that we are now beginning to understand. We define protein condensation diseases as conditions caused by the disruption of the normal behaviour of the condensed states of proteins. We analyze the problem of the identification of targets for pharmacological interventions for these diseases and explore opportunities for the regulation of the formation and organisation of aberrant condensed states of proteins.

Furthermore, we analyzed 644,521 disease-associated missense variants in human proteins in the Human Variants Database 3 , finding that 145,924 missense mutations in experimentallyidentified condensate components and 232,866 variants in predicted droplet-forming proteins are associated with human diseases. Since about one third of these mutations are in regions predicted to drive droplet formation 1 , we estimate that over 100,000 disease-associated missense mutations may alter condensate properties, shift the phase boundary, or promote aggregation. In addition, for 327 proteins forming membraneless organelles and for nearly 800 predicted droplet-forming proteins, over 70% of disease-associated missense mutations are in regions that likely drive condensate formation (Supplementary data set: Table S3). The accumulation of missense mutations in droplet-promoting regions suggests that diseasecausing mutations alter condensate properties.

Datasets of droplet-forming proteins
According to available experimental evidence, we listed the droplet-forming human proteins that are components of membraneless-organelles (MLO). These proteins were either observed to undergo spontaneous liquid-liquid phase separation, as droplet-driver proteins, or were identified by high-throughput studies as components of cellular condensates (droplet-clients). Droplet-driver proteins were collected from public databases (PhaSepDB dataset (http:/db.phasep.pro) 4 , PhaSePro (https://phasepro.elte.hu) 5 , LLPSDB (http://biocomp.org.cn/llpsdb) 6 ) in a previous study 1 , and complemented by new cases in the updated PhaSepDB v2 4 . Proteins classified as 'PS-SELF' were defined as droplet drivers. Proteins classified as 'PS-OTHER' and components of membraneless organelles identified by highthroughput studies by organelle purification 7,8 , affinity purification 9,10 , immunofluorescence image based screen 11,12 , and proximity labelling 13,14 were assembled as droplet-clients. The human MLO dataset contained 4434 proteins, 462 droplet-drivers and 3972 droplet-clients (Supplementary data set: Table S1). We note that most droplet-forming proteins are also amyloid-forming proteins 15,16 .
As more proteins may be expected to drive phase separation than those currently deposited in public datasets, we also assembled proteins predicted to undergo liquid-liquid phase separation using FuzDrop (pLPS ≥ 0.60) 1 . In addition to the experimentally-identified condensate-forming proteins, we assembled 5757 predicted human droplet-driver proteins (PC: predicted condensates, Supplementary data set: Table S1), using UniProt (May 2021).

Datasets of non-condensate proteins
10635 proteins in UniProt (May 2021) that have not been experimentally observed yet to undergo liquid-liquid phase separation, neither identified as components of membraneless organelles, nor predicted to drive condensate-formation were considered as non-condensate forming proteins. These proteins may also have droplet-promoting regions that can facilitate their partitioning into condensates.

Disease-gene associations
Disease-gene associations were derived from the DisGeNet database (http://disgenet.org) 2 . We analyzed 9277 diseases associated with protein-coding genes derived from curated resources (UniProt, Comparative Toxicogenomics Database, Orphanet, Clinical Genome Resource, Genomics England PanelApp, Cancer Genome Interpreter and Psychiatric disorders Gene association Network) and 21552 diseases associated with protein-coding genes also including inferred data (Human Phenotype Onthology, ClinVar and genome-wide association studies NHGRI-EBI-GWAS). We only analyzed pathologies defined as disease, and not phenotypes or groups. Diseases were termed as in the database using MSH classifications.

Missense variants affecting protein condensates
We analyzed 644,521 missense mutations of 17450 human proteins in the Human Variants Database, which was assembled mutational data from 1000 Genomes, ClinVar, COSMIC, SwissVar, and Humsavar. Disease names in HuVarBAse were termed based on Genetic Testing Registry (https://www.ncbi.nlm.nih.gov/gtr/) and we considered all the disease associations of the missense variants. We grouped missense variants based on the condensate-forming ability of the corresponding proteins, whether they were identified as components of membraneless organelles (MLO), or were predicted to form condensates (PC) or belonged to proteins not known to form condensates (NONE).

Ranking of protein condensation diseases
To identify diseases with major contributions from mutations affecting condensate properties, we used two approaches. In the first, based on disease-gene associations in the DisGeNet database, we collected all the protein-coding genes associated with a given disease. We then determined the contributions of genes that encoded experimentally observed droplet-forming proteins (MLO), and genes encoded predicted condensates (PC) by calculating the fraction of genes encoding droplet-forming proteins (fDROP=(nMLO + nPC)/nTOT). We ranked the diseases based on the fraction of genes encoding droplet-forming proteins forming membranelessorganelles (Supplementary data set: Table S2). This ranking evaluated the contribution of genes encoding condensate-forming and non-condensate forming proteins and could identify those diseases, where the associated droplet-coding genes make major contributions (Supplementary data set: Table S2). Diseases were classified according to the standard classification system (MSH as defined in the DisGeNet database).
In the second approach, we determined the number of missense variants of proteins that form membraneless organelles (nMLO), the number of missense variants of proteins that are predicted to form condensates (nPC) and missense variants of proteins that are not known to form condensates (nNONE). Then we computed the fraction of missense mutations in droplet-forming proteins (fDROP=(nMLO + nPC)/nTOT). We ranked the diseases based on the fraction of missense mutations in membraneless component proteins and predicted condensate-forming proteins. In this analysis we only ranked diseases, which are associated with proteins, where most missense mutations fall into droplet-promoting regions (Supplementary data set: Table S3).

Pathways associated with protein condensation diseases
We analyzed the biochemical pathways enriched in disease-associated droplet-forming proteins. In particular, we computed through the STRING database search tools 17 the enrichment of biological processes and molecular functions of the Gene Ontology database, KEGG pathways and Wikipath, which are enriched in disease-associated genes (Supplementary data set: Table S2), encoding membraneless-organelle forming proteins and predicted condensates (Supplementary data set: Table S4).